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Abstract 

Digital video is rapidly becoming important for education, entertainment, and a host of multi- 
media applications. With the size of the video collections growing to thousands of hours, technol- 
ogy is needed to effectively browse segments in a short tune without losing the content of the 
video. We propose a method to extract the significant audio and video information and create a 
“skim” video which represents a very short synopsis of the original. The goal of this work is to 
show the utility of integrating language and image understanding techniques for video skimming 
by extraction of significant information, such as specific objects, audio keywords and relevant 
video structure. The resulting skim video is much shorter, where compaction is as nigh as 20:1. 
and yet retains the essential content of the original segment. 
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Projects Agency. Michael Smith is sponsored by Bell Laboratories. The views and conclusions 
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representing official po’icies or endorsements, either expressed or implied, of the United States 
Government or Bell Laboratories. 
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1 Introduction 

With increased computing power and electronic storage capacity, the potential for large digital 
video libraries is growing rapidly. These libraries, such as the Intermedia™ Project at Carnegie 
* ' [7], will make thousands of hours of video available to a user. For many users, the video 

,<*st is not always a full-length film. Unlike video-on-demand, video libraries should pro- 
vide informational access in the form of brief, content-specific segments as well as full-featured 
videos. 

Even with irtelligent content-based search algorithms being developed 15]. (.11]. multiple 
video segments will be returned for a given query to insure retrieval of pertinent information. The 
users will often need to view all the segments to obtain their final selections. Instead, the user will 
want to “skim” the relevant portions of video for the segments related to their query. 

Browsing Digital Video 

Simplistic browsing techniques, such as fast-forward playback and skipping video frames at 
fixed intervals, reduce video vie ng time. However, fast playback perturbs the audio and distorts 
much cf the image information^], and displaying video sections at fixed intervals merely gives a 
random sample of the o\erall content. Another idea is to present a set of “representative” video 
frames (e.g. keyframes in motion-based encoding) simultaneously on a display screen. While use- 
ful and effective, such static displays miss an important aspect of video: video contains audio 
information. It is critical to use and present audio information, as well as image infomiation, for 
browsing. Recently, researchers have proposed browsing representations based on information 
within the video (81. (9], [10]. These systems rely on the motion in a scene, placement of scene 
breaks, or image statistics, such as color and shape, but they do not make integrated use of image 
and language understanding. 

An idea] browser would display only the video pertaining to a segment's content, suppressing 
irrelevant data. It would show less video than the original and could be used to sample many seg- 
ments without viewing each in its entirety. The amount of content displayed should be adjustable 
so the user can view as much or as little video as needed, from extremely compact to full-length 
video. The audio portion of this video should also consist of the significant audio or spoken 
words, instead of simply using the synchronized portion corresponding to the selected video 
frames. 

Video Skims 

Figure 1 illustrates the concept of extracting the most representative video frames and audio 
infonnation to create the skim. The critical aspect of compacting a video is context understanding, 
which is the key to choosing the “significant images and words” that should be included in the 
skim video. We characterize the significance of video through the integration of image and lan- 
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figure 2: Video Characterization Technology. Video is segmented into scenes, and camera motion Is detected 
along with significant objects (faces and text). Bars show frames with positive results. 

guage understanding. Segment breaks produced by image processing can be examined along with 
boundaries of topics identified by the language processing of the transcript. The relative impor- 
tance of each scene can be evaluated by 1) the objects that appear in it 2) the associated words, 
and 3) the structure of the video scene. The integration of language and image understanding is 
needed to realize this level of characterization and is essential to skim creation. 

In the sections that follow, we describe? the technology involved in video characterization 
from audio and images embedded within the video, and the process of integrating this information 
for skim creation. 

2 Video Characterization 

Through techniques in image and language understanding, we can characterize scenes, seg- 
ments, and individual frames in video. Figure 2 illustrates characterization of a segment taken 
from a video titled “Destruction of Species”, from WQ17D Pittsburgh. Ai the moment, language 
understanding entails identifying the most significant words in a given scene, and for image 
understanding, it entails segmentation of video into scenes, detection of objects of importance 
(face and text) and identification of the structual motion of a scene. 

2.1 Language Characterization 

Language analysis works on the transcript to identify important audio regions known as “key- 
words”. We use the well-known technique of FF IDF (Term Frequency Inverse Document Fro- 
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TF-1DF . f (]) 
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quency) to measure relative importance of words for the video document [5], The TF-1DF of a 
word is its frequency in a given scene,/,, divided by the frequency, f r of its appearance di a stan- 
dard corpus. Words that appear often in a particular segment, but relatix ely infrequently in a stan- 
dard corpus, receive the highest TF-1DF weights. A threshold is set to extract keywords from the 
TF-IDF weights, as shown in the bottom rows of Figure 2. 



22 Scene Segmentation 

Many research groups have developed working techniques for detecting scene changes [8J, 
[3], [9]. We choose to segment video by the use of a comparative color histogram difference mea- 
sure. By detecting significant changes in the weighted color histogram of each successive frame, 
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v » 0 

{ v) : Histogram of Color :c ImageKh 

video sequences are separated into scenes. Peaks in the difference, D(t). arc detected and an 
empirically set threshold is used to select scene breaks. We have found that this technique is sim- 
ple, and yet robust enough to maintain high levels of accuracy for our purpose. Using this tech- 
nique, we have achieved 91% accuracy in scene segmentation on a test set of roughly 495.000 
images (5 Lours). Examples of segmentation results are shown in the top row of Figure 2. 

2.3 Camera Motion Analysis 

One important aspect of video characterization is interpretation of camera motion. The global 
distribution of motion vectors distinguishes between object motion and actual camera motion. 
Object mr on typically exhibits flow fields in specific regions of an image. Camera motion is 
characterized by flow throughout the entire image. 

Motion vectors for each 16x16 block are available with little computation in the MPEG- 1 video 
standard [13]. An affine model is used to approximate the flow patterns consistent with all types 
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of camera motion. Affine parameters a,b.c,d,e. and / are calculated by minimizing the least squares 
error of the motion vectors. We also compute average flow v and u . 

Using the affine flow parameters and average flow, we classify the flow pattern. To determine 
if a pattern is a zoom, we first check if there is the convergence or divergence point (.r^ y 0 ). where 
. n and >■. ; = o . To solve for (. x^y 0 ), the following relation must be true: » ' , |«o 

If the above relation is true, and (%y,)l is located inside the image, then it must represent the focus 
of expansion. If v and u , are large, then this is the focus of the flow and camera is zooming. If 
(*r>)7/) i s outside the image, and v or u are large, then the camera is panning ir. the direction of 
the dominant vector. 

If the above determinant is approximately 0, then (x^y#) does not exist anti camera is panning 
or static. If v or ware large, the motion is panning in the direction of the domimint vector. 
Otherwise, there is no significant motion and the flow is static. We eliminate fragmented motion 
by averaging the results in a 20 frame window over time. Table 1 shows the statistics for detection 


Table 1: Camera Motion Detection Results 
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figure 4: Detection of buiuan-fnces. 


Figuri 3: Camera motion analysis from MPEG motion vectors: .4) Zoom distribution, B) Upward 
ium with subtle object motion. Cl Static, D) Significant object motion detected as pan. 


on various sets of images. Regions detected are cither pans or zooms. Examples of the camera 
motion analysis results arc shown in Figure 3. 


2.4 Object Detection 

Identifying significant objects that appear in the video frames is one of the key components 
for video characterization. For the time being, we have chosen to deal with two of the more inter- 
esting objects in video: human faces and text (caption characters). To reduce computation we 
detect text and faces ever/ 15th frame. 

Face Defection 

The '“talking head” image is common in interviews and news clips, and illustrates a clear 
example of video production focussing on an individual of interest. A human interacting wiJiin 
an environment is also a common theme in video. The human-face detection system used for our 
experiments was developed by Rowley, Ba'uja and Kanadc [6|. It detects mostly frontal faces of 
any size and any background. Its current performance level is to detect over 86^ of more than 
507 faces contained in 130 images, while producing approximately 63 false deteett '- r "-Tulc 
improvement is needed, the system can detect fares of varying sizes and is especially reliable 
with frontal faces such as talking-head images. Figure 4 shows examples of its output, illustrating 
the range of face sizes that can be detected. 
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Figure 5: Stages or text detection: A) Input, B) Filtering, C) Clustering, and D) Region Extraction. 


Tfcxt Detection 

Text in the video provides significant information as to the content of a scene. For example, 
statistical numbers and titles are not usually spoken but are included in the captions for viewer 
inspection. A typical text region can be characterized as a horizontal rectangular structure of clus- 
tered sharp edges, because characters usually form regions of high contrast against the back- 
ground. By detecting these properties we extract regions from video frames that contain textual 
infonnation. Figure 5 illustrates the process of detecting text; primarily, regions of horizontal 
titles and captions. 

We first apply a 3x3 horizontal differential filter to the entire image with appropriate binary 
thresholding for extraction of vertical edge features. Smoothing filters are then used to eliminate 
extraneous fragments, and to connect character sections that may have been detached. Individual 
regions are identified by cluster detection and their bounding rectangles are computed. Clusters 
with bounding regions that satisfy the following constraints arc selected: 


CluvtrrSize > ?0pneN 

Cluster Fill Factor 2 0.45 

Hc*ri 2 oniai- Vertical Aspect Ratio 2 0.75 


A cluster’s bounding region must have a large horizontal -to-vcitical aspect ratio as well as satisfy- 
ing various limits in height and width. The fill factor of the region should be high to insure dense 
clusters. The cluster size should also be relatively large to avoid small fragments. An intensity 
histogram of each region is used to tes' for high contrast. This is because certain textures and 
shapes appear j'xr»ilar to text but exhibit low contrast when examined in a bounded region. 
Finally, consistent deletion of the same region over a certain period of time is also tested since 
text regions are placed at the exact position foi many video frames. Figure 6 shows detection 
examples of words and suhsets of a word. Table 2 presents statistics for detection on various sets 
of images. 


Table 2: Text Region Detection Results 
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3 Technology Integration and Skim Creation 

We have characterized video by scene breaks, camera motion, object appearance and Key- 
words. Skim creation involves selecting the appropriate keywords and choosing a corresponds 
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Figure 6: Text deteettaa remits with various Image*. 


Primitive Rules, independent rules that provide candidates for the selection of image regions for a 
given keyword, and 2) Meta-Rules, higher order rules that select a single candidate from the prim- 
itive rules according to global properties of the video. The subsections below describe the steps 
involved in the selection, prioritizing and ordering of the keywords and video frames. 

3.1 Audio Skim 

The first level of analysis for the skim is the creation of the reduced audio track, which is 
based on the keywords. Those words whose TF-1DF values are higher than a fixed threshold are 
selecced as keywords. By varying this threshold, we control he number of keywords, and thus, 
the length of the skim. The length of the audio track is determined by a user specified compaction 
level. 

Keywords that appear in dose proximity or repeat throughout the transcript may create skims 
with redundant audio. Therefore, we discard keywords which repeat within a minimum number of 
frames (150 frames) and limit the repetition of each word. 

Our experiments have shown that u.ing individual keywords creates an audio skim which is 
fragmented and incomprehensible for some speakers. To increase comprehension, we use longer 
audio sequences, “keyphrascs”, in the audio skim. A keyphrase is obtained by stalling with a key- 
word, and extending its boundaries to areas of silence or neighboring keywords. Each keyphrase 
is isolated from the original audio track to form the audio skim. The average keyphrase lasts 2 
seconds. 

3.2 Video Skim Candidates 

In order to create the image skim, we might think of selecting those video fran.es that corre- 
spond in time to the audio skim segments. As we often observe in television programs, however, 
the contents of the audio and video are not necessarily synchronized. Therefore, for each keyword 
or keyphrase we must analyze the characterization results of the surrounding video frames and 
select a set of frames which may not align with the audio in time, but which are most appropriate 
for skimming. 

To study the image selection process of skimming, we manually created skims for 5 hours of 
video with the help of pioducers and technicians in Carnegie Mellon's Drama Department. The 
study revealed thar while perfect skinuning requires semantic understanding of the entire video, 
certain parts of the image selection process can be automated with current image understanding. 
By studying these examples and video production standards [14], we can identify an initial set of 
heuristic mles. 

The first heuristics are the primitive rules, which are tested with the video frames in the scene 
containing the keyword/keyphrase. and the scenes that follow w irhin at least a 5 second window. 
A description of each primitive rule is given in order of priority below. The four rows above 
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“Skim Candidates’’, in Figure 7, indicate the candidate image sections selected by various primi- 
tive rules. 

1. Introduction ScenesdWS) 

The scenes prior to the introduction of a proper name usually describe a person’s accomplish- 
ment and often precede scenes with large views of the person’s face. If a keypnrase contains a 
proper name, and a large human face is detected within the surrounding scenes, then we set the 
face scene as the last frame of the skim candidate and use the previous frames for the beginning. 

2. Similar Scenes(SIS) 

The histogram technology in scene segmentation gives us a simple routine for detecting simi- 
larity between scenes. Scenes between successive shots of a human face usually imply illustration 
of the subject. For example, a video producer will often interleave shots of research between shots 
of a scientist. Images between similar scenes that are less than 5 seconds apart, are used for skim- 
ming. 

3. Short Sequences(SHS) 

Short successive shots often introduce a more important topic. By measuring the duration of 
each scene, we can detect these regions and idsnufy “short shot" sequences The video frames 
that follow these sequences and the exact sequence are used for skimming. 

4. Object Motion(OBM) 

Object motion is import simply because video producers usually include this type of footage 
to show something in action. We are currently exploring ways to detect object motion in video. 

5. Bounded Camera MoUon(BCM/ZCM) 

The video frames that proceed or follow a pan or zoom motion are usually the focus of the 
segment. We can isolate the video regions that are static and bounded by segments with motion, 
and therefore likely to be the focal point m a scene containing motion. 

6. Human Faces and Captk>n*(TXT/FAC) 

A scene will often contain recognizable humans, as well as captioned text to describe the 
scene. If a scene contains both faces and text, the portion containing text is used for skimming. A 
lower \vel of priority is given to the scenes with video frames containing only human-faces or 
text. For these scenes priority is given to text. 
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7. Significant AndkXAUD) 

If the audio is music, then the scene may not be used for skimming. Soft music is often used as 
a transitional tool, but seldom accompanies images of high importance High audio levels (e g. 
loud music, explosions) may imply an important scene is about to occur The skim region will 
start after high audio levels or music. 

8. Default RuiefDEF) 

Default video frames align to the audio keyphrases. 

33 Image Acfyustments 

With prioritized video frames from each scene, we now have a suitable representation for 
combining the image and audio skims for the final skim. A set of higher order Meta-Rules are 
used to complete skim creation. 

For visual clarity «snd comprehension, we allocate at least 30 video frames to a key phrase The 
30 frame minimum for each scene is based on empirical studies of visual comprehension in short 
video sequences. When a keyphrase is longer than 60 video frames, we include frames from skim 
candidates of adjacent scenes within the 5 second search window The final skim borders are 
adjusted to avoid image regions that overlap or continue into adjacent scenes by less than 30 
frames. 

To avoid visual redundancy, we reduce the presence of human faces and default image regions 
in the skim. If the highest ranking skim candidate for a keyphrase is the default, we extend the 
search range to a 10 second window and look for other candidates. The human face rule is limited 
if the segment contains several interviews. Interview scenes can be extremely long, so we look for 
other candidates in a 15 second search window. 

Figure 8 illustrates the adjustment ai>d final selection of video skims It shows how and why 
the image segments, which do not necessarily correspond in time to the audio segments, are 
srlecteo 
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3.4 Example Results 

Figure 9 shows the video frames and audio from the "Planet Earth” video. The image portion 
of the skim has captured information from 18 of the 64 total scenes in the video. With the excep- 
tion of the scene at frame 585, which lasts over 1.300 frames in the original video, most scenes 
arc small and provide maximum visual information. An error in scene segmentation, near frame 
702, causes this scene to split and, therefore, it is used twice for separate keyphrases. The final 
scene in the original video is long and contains two keyphrases. In this case, the search window 
cannot extend to other scenes and these keyphrases must share image frames from the final scene 
Introduction scenes, bounded camera motion and human faces dominate the image skims for this 
segment. 

Figure 10 shows another example from the "Planet Earth" video with 16 of the 37 scenes rep- 
resented. This segment contains many long outdoor scenes that provide little information. How- 
ever. most primitive rules do not match these scenes so the search window is extended and they 
appear less frequently in the image skim. The scene at frame 828 is an interview scene which con- 
tains 3 keyphrases and lasts several frames. Even with an extended search window, the scenes that 
follow do not match any of the primitive rules so the image skim is rather long for this scene 

Figure 1 1 shows two types of skims for the "Mass Extinction" segment. Skim A was produced 
with our method of integrated image and language understanding. Skim B was created by select- 
ing video and audio portions at fixed intervals. This segment contains 71 scenes, of which, skim A 
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has captuieri 23 scenes, and skim B has captured 17 scenes. Studies involving different skim ere* 
. ion methods are discussed in the next section. 

Skim A has only 1632 frames, while the first scene of the original segment is an interview that 
lasts 1734 frames. The scenes that follow this interview contain camera motion, so we select them 
for the keyphrases towards the end of the scene. Charts and figures interleaved between succes- 
sive human subjects are selected for the latter scenes 


Table 3: Skim Compaction Dai 
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Comments 

M C- Manually Assisted CharacterizationAC- Automated Characterization 
MS- Manual Skim CreationAS- Automated Skim Creation 

3.5 User Evaluation 

The results of several skims arc s .nmarized in Table 3. The manually created skims in the ini- 
tial stages of the experiment help test the potential visual clarity and comprehension of skims The 
compaction ratio for a typical segment is 10:1; and it was shown that skims with compaction as 
high as high as 20.1 still retain most of the content. Our results show the information representa- 
tion potential of skims, but we must test our work with human subjects to study its effectiveness. 

We are conducting a user-study to test the content summarization and effectiveness of the 
skim as a browsing tool in a video library. Subjects must navigate a video library to answer a 
senes of questions. The effectiveness of each slum is based on the time to complete this task and 
the number of correct items retrieved. Although our evaluation results are tentative, the skim does 
appear to be an effective tool for browsing, as evident by the difference of time that subjects 
spend in skim mode versus regular playback mode. 

We use various types of skims to test the utility of image and language understanding in skim 
creation. The following creation schemes are presently being tested: 

A - Image and Language Characterization 
D - Fixed Intervals (Default) 

C - Language Characterization Only 
D - Image Characterization Only 

Figure 11 shows examples of skim type A and B. The visual information in skim A is less 
redundant and provides a greater variety of scenes. The aud'O for skim B is incoherent and con- 
siderably smaller. Although our skim does appear to provide more information, additional testing 
is needed 
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4 Conclusions 

The emergence of high volume video libraries has shown a dear need for content- specific 
video-browsing technology. We have described an algorithm to create skim videos that consist of 
content rich audio and video information Compaction of video as high as 20: 1 has been achieved 
without apparent loss in content. 

While the generation of content-based shims presented in this paper is very limned due to the 
fact that the true understanding of video frames is extremely difficult, it illustrates the potential 
power of integrated language, and image information for characterization in video retrieval and 
browsing applications. 
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